-
Notifications
You must be signed in to change notification settings - Fork 291
Enable model caching for Whisper pipeline on GPU and NPU #2759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ov::AnyMap ov_config; | ||
if (device == "NPU" || device.find("GPU") != std::string::npos) { // need to handle cases like "GPU", "GPU.0" and "GPU.1" | ||
// Cache compiled models on disk for GPU and NPU to save time on the | ||
// next run. It's not beneficial for CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's not beneficial for CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This comment is simply copied from the reference sample code.
- AFAIK CPU plugin's "compile" step is mostly graph rewrites and primitive selection. It’s typically milliseconds–a few hundred ms, not seconds–minutes like on GPU/NPU.
- Most importantly, enable model caching on CPU causes Whisper pipeline crashed. This looks like a bug which needs further investigation. So currently model caching is enabled only on GPU and NPU to avoid the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds GPU/NPU model caching configuration to Whisper speech recognition sample code.
- Introduces helper to build caching config in both Python and C++ samples.
- Applies conditional logic to enable caching only on GPU/NPU devices.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
samples/python/whisper_speech_recognition/whisper_speech_recognition.py | Adds cache config helper and conditional passing of CACHE_DIR to WhisperPipeline. |
samples/cpp/whisper_speech_recognition/whisper_speech_recognition.cpp | Adds cache config helper and conditional AnyMap passed to WhisperPipeline. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
ov_config = dict() | ||
if args.device == "NPU" or "GPU" in args.device: # need to handle cases like "GPU", "GPU.0" and "GPU.1" | ||
# Cache compiled models on disk for GPU and NPU to save time on the | ||
# next run. It's not beneficial for CPU. | ||
ov_config = get_config_for_cache() | ||
|
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition handles GPU variants (e.g. GPU.0) but will skip NPU variants such as 'NPU.0', limiting caching despite the PR goal to enable it for NPU. Update the condition to also match NPU suffixed forms, e.g.: if 'GPU' in args.device or args.device.startswith('NPU'):. Alternatively use substring checks for both: if 'GPU' in args.device or 'NPU' in args.device:.
Copilot uses AI. Check for mistakes.
ov::AnyMap ov_config; | ||
if (device == "NPU" || device.find("GPU") != std::string::npos) { // need to handle cases like "GPU", "GPU.0" and "GPU.1" | ||
// Cache compiled models on disk for GPU and NPU to save time on the | ||
// next run. It's not beneficial for CPU. | ||
ov_config = get_config_for_cache(); | ||
} | ||
|
||
ov::genai::WhisperPipeline pipeline(models_path, device, ov_config); |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition enables caching for GPU variants but misses NPU variants like 'NPU.0', restricting caching contrary to the stated intent. Adjust to also detect NPU substrings: if (device.find("GPU") != std::string::npos || device.find("NPU") != std::string::npos) { ... }.
Copilot uses AI. Check for mistakes.
build_jenkins |
…lkit#2759) Whisper sample code to enable model caching on GPU and NPU This is openvinotoolkit#2751 follow up Sample Code Reference: https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87 https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52 OPTIMIZE_SIZE and encryption are not included. The main performance concern for Whisper is pipeline speed. Since Whisper is much smaller than LLMs, size optimization offers only very little savings while potentially adding latency. Similarly, model encryption can also introduce additional latency.
Whisper sample code to enable model caching on GPU and NPU
This is #2751 follow up
Sample Code Reference:
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52
OPTIMIZE_SIZE and encryption are not included. The main performance concern for Whisper is pipeline speed. Since Whisper is much smaller than LLMs, size optimization offers only very little savings while potentially adding latency. Similarly, model encryption can also introduce additional latency.